Sliding Window Calculations on Streaming Data using the Kepler Scientific Workflow System

نویسندگان

  • Sven Köhler
  • Supriya Gulati
  • Gongjing Cao
  • Quinn Hart
  • Bertram Ludäscher
چکیده

In many areas of science unbounded (potentially infinite) data streams need to be processed in a continuous manner, e.g., to compute running aggregates or sliding window aggregates. One important example is the computation of Growing Degree Days (GDD) from a stream of temperature data, which provides a heuristic tool to predict plant development and the maturity of crops. The process of data acquisition, processing, storage, and presentation forms a scientific workflow and scientific workflow systems have been developed to automate their execution. The whole workflow is decomposed into its individual steps, represented by actors, which in turn are connected by channels that describe the flow of data. This workflow representation allows to reuse existing components for different workflows, and, in principle, easy modification of existing workflows. In current streaming workflow designs in Kepler, data belonging to a particular time window is typically identified by counting data tokens on channels between actors. For example, this token-counting approach does not work for windows of variable length nor for overlapping windows. In this paper, we address these limitations and present a new actor design with two incoming streams: a time-stamp ordered data stream, and a stream of aggregation windows, ordered by their start time. We present a new Chunker actor that “stream-joins” the data from one stream with the windows presented on the second stream, where windows represent aggregation intervals of variable length and possibly overlapping time. Windows containing the corresponding data are output as soon as they are completed, i.e. once timestamps in the data stream pass the end time of a window. We illustrate the approach with an improved GDD workflow based on our new Chunker actor.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scientific Workflow Infrastructure for Computational Chemistry on the Grid

We present ongoing research in the Resurgence (RESearch sURGe ENabled by CyberinfrastructurE) project. This infrastructure shall enable the flexible combination of computational chemistry tools from a unified interface, with the focus on automated high-throughput processing. The implementation is based on the idea that the time-consuming parts of the calculations can be distributed onto computa...

متن کامل

Provenance Collection Support in the Kepler Scientific Workflow System

In many data-driven applications, analysis needs to be performed on scientific information obtained from several sources and generated by computations on distributed resources. Systematic analysis of this scientific information unleashes a growing need for automated data-driven applications that also can keep track of the provenance of the data and processes with little user interaction and ove...

متن کامل

Using Web Services and Scientific Workflow for Species Distribution Prediction Modeling

Species distribution prediction modeling plays a key role in biodiversity research. We propose to publish both species distribution data and modeling components as Web services and composite them into modeling systems using the scientific workflow approach. We build a prototype system using Kepler scientific workflow system and demonstrate the feasibility of the proposed approach. This study is...

متن کامل

Incorporating Semantics in Scientific Workflow Authoring

The tools used to analyze scientific data are often distinct from those used to archive, retrieve, and query data. A scientific workflow environment, however, allows one to seamlessly combine these functions within the same application. This increase in capability is accompanied by an increase in complexity, especially in workflow tools like Kepler, which target multiple science domains includi...

متن کامل

FDiBC: A Novel Fraud Detection Method in Bank Club based on Sliding Time and Scores Window

One of the recent strategies for increasing the customer’s loyalty in banking industry is the use of customers’ club system. In this system, customers receive scores on the basis of financial and club activities they are performing, and due to the achieved points, they get credits from the bank. In addition, by the advent of new technologies, fraud is growing in banking domain as well. Therefor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012